RLuckily for us, fitting a linear model to some data does not require us to iteratively find the best intercept and slope manually. As it turns out, R can do this much more precisely, and very fast!
Let’s explore how to do this, using a real life dataset taken from the Ecdat package which includes many economics-related dataset. In this example, we will use the Caschooldataset which contains the average test scores of 420 elementary schools in California along with some additional information.
We can explore which variables are included in the dataset using the names() function:
library("Ecdat") # Attach the Ecdat library
names(Caschool) # Display the variables of the Caschool dataset
## [1] "distcod" "county" "district" "grspan" "enrltot" "teachers"
## [7] "calwpct" "mealpct" "computer" "testscr" "compstu" "expnstu"
## [13] "str" "avginc" "elpct" "readscr" "mathscr"
For each variable in the dataset, basic summary statistics can be obtained by calling summary()
summary(Caschool[, c("testscr", "str", "avginc")])
## testscr str avginc
## Min. :605.5 Min. :14.00 Min. : 5.335
## 1st Qu.:640.0 1st Qu.:18.58 1st Qu.:10.639
## Median :654.5 Median :19.72 Median :13.728
## Mean :654.2 Mean :19.64 Mean :15.317
## 3rd Qu.:666.7 3rd Qu.:20.87 3rd Qu.:17.629
## Max. :706.8 Max. :25.80 Max. :55.328
Suppose a policymaker is interested in the following linear model:
\[testscr_i = \beta_0 + \beta_1 \times (str)_i + \epsilon_i\] Where \((testscr)_i\) is the average test score for a given school \(i\) and \((str)_i\) is the Student/Teacher Ratio (i.e. the average number of students per teacher) in the same school \(i\). We can think of \(\beta_0\) and \(\beta_1\) as the intercept and the slope of the regression line.
The subscript \(i\) indexes all unique elementary schools (\(i \in \{1, 2, 3, \dots 420\}\)) and \(\epsilon_i\) is the error, or residual, of the regression. (Remember that our procedure for finding the line of best fit is to minimize the sum of squared residuals (SSR)).
At this point you should step back and take a second to think about what you believe the relation between a school’s test scores and student/teacher ratio will be. Do you believe that, in general, a high student/teacher ratio will be associated with higher-than-average test scores for the school? Do you think that the number of students per teacher will impact results in any way?
Let’s find out! As always, we will start by plotting the data to inspect it visually (don’t worry if the syntax doesn’t make much sense right now, we will come back to it very soon):
plot(formula = testscr ~ str,
data = Caschool,
xlab = "Student/Teacher Ratio",
ylab = "Average Test Score", pch = 21, col = 'blue')
Can you spot a trend in the data? According to you, what would the line of best fit look like? Would it be upward or downward slopping? Let’s ask R!
lm() functionWe will use the built-in lm() function to estimate the coefficients \(\beta_0\) and \(\beta_1\) using the data at hand. The lm() function typically only takes 2 arguments:
lm(formula, data)
formula is the description of our model which we want R to estimate for us. Its syntax is very simple: Y ~ X (more generally, DependentVariable ~ Independent Variables). You can think of the tilda operator ~ as the equal sign in your model equation. An intercept is included by default and so you do not have to ask for it in formula. For example, the simple model \(income = \beta_0 + \beta_1 \cdot age\) can be written as income ~ age. You can also ask R to estimate a multivariate regression such as \(income = \beta_0 + \beta_1 \cdot age + \beta_2 \cdot isWoman\) by simply separating all variables on the right-hand side of the equation with the + operator, like this : income ~ age + isWoman. A formula can sometimes be written between quotation marks: "X ~ Y".
data is simply the data.frame containing the variables in the model.
In the context of our example, the function call is therefore:
lm(formula = "testscr ~ str",
data = Caschool)
##
## Call:
## lm(formula = "testscr ~ str", data = Caschool)
##
## Coefficients:
## (Intercept) str
## 698.93 -2.28
As we can see, R automatically spits out its estimates for the Intercept and Slope coefficients, \(\hat{\beta_0} =\) 698.93 and \(\hat{\beta_1} =\) -2.28. The relationship between a school’s Student/Teacher Ratio and its average test results is negative.
For reasons that are outside the scope of this class, running a linear regression in R is typically a two-steps process. You first assign the output of the lm() call to an object and then call a second function (for our purpose, mainly summary()) on the resulting object. In practice, this looks like this :
# assign lm() output to some object `fit_california`
fit_california <- lm(formula = "testscr ~ str", data = Caschool)
# ask R for the regression summary
summary(fit_california)
##
## Call:
## lm(formula = "testscr ~ str", data = Caschool)
##
## Residuals:
## Min 1Q Median 3Q Max
## -47.727 -14.251 0.483 12.822 48.540
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 698.9330 9.4675 73.825 < 2e-16 ***
## str -2.2798 0.4798 -4.751 2.78e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.58 on 418 degrees of freedom
## Multiple R-squared: 0.05124, Adjusted R-squared: 0.04897
## F-statistic: 22.58 on 1 and 418 DF, p-value: 2.783e-06
Again, we recognize our intercept and slope estimates from before, alongside some other numbers and indications. This output is called a regression table, and you will be able to decypher it by the end of this course.
We can also use our lm fit to draw the regression line on top of our initial scatterplot, using the following syntax:
plot(formula = testscr ~ str,
data = Caschool,
xlab = "Student/Teacher Ratio",
ylab = "Average Test Score", pch = 21, col = 'blue')# same plot as before
abline(fit_california, col = 'red') # add regression line
As you probably expected, there is a slight downward correlation between a school’s Student/Teacher Ratio and its average test results.
With these tools in place, it is fairly straigtforward to estimate models with more than one dependent variables, for example adding families’ average income to our previous model:
\[testscr_i = \beta_0 + \beta_1 (str)_i + \beta_2 (avginc)_i + \epsilon_i\] We can incoporate this new variable to our model by simply adding it to our formula:
fit_multivariate <- lm(formula = "testscr ~ str + avginc", data = Caschool)
summary(fit_multivariate)
##
## Call:
## lm(formula = "testscr ~ str + avginc", data = Caschool)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39.608 -9.052 0.707 9.259 31.898
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 638.72915 7.44908 85.746 <2e-16 ***
## str -0.64874 0.35440 -1.831 0.0679 .
## avginc 1.83911 0.09279 19.821 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.35 on 417 degrees of freedom
## Multiple R-squared: 0.5115, Adjusted R-squared: 0.5091
## F-statistic: 218.3 on 2 and 417 DF, p-value: < 2.2e-16
Although it is quite cumbersome and not typical to visualize multivariate regressions, we can still do this with 2 explanatory variables using a regression (2-dimensional) plane.
While you explore this plot, ask yourself the following question: if you could only choose one of the two explanatory variables in our model (that is, either \(str\) or \(avginc\)) to predict the value of a given school’s average test score, which one would you choose? Why? Discuss this with your classmates.